\(^1\)

\(^2\) GenomEast platform, IGBMC

1 Introduction

During this training session, we are going to analyze data from Laurette et al. (Laurette et al. 2015) which data are deposited in GEO as GSE61967. There are ChIP-seq and RNA-seq data. Data were downloaded from GEO and processed.

1.1 Processing of RNA-seq data

1.1.1 Preprocessing

Reads were preprocessed in order to remove adapter, polyA and low-quality sequences (Phred quality score below 20). After this preprocessing, reads shorter than 40 bases were discarded for further analysis. These preprocessing steps were performed using cutadapt (Martin 2011) version 1.10.

1.1.2 Mapping

Reads were mapped onto the hg38 assembly of Homo sapiens genome using STAR (Dobin et al. 2013) version 2.5.3a.

1.1.3 Quantification

Gene expression quantification was performed from uniquely aligned reads using htseq-count verson 0.6.1.p1 (S. Anders, Pyl, and Huber 2015), with annotations from Ensembl version 103 and “union” mode. Only non-ambiguously assigned reads have been retained for further analyses.

1.1.4 Normalization

Read counts have been normalized across samples with the median-of-ratios method proposed by Anders and Huber (Anders and Huber 2010), to make these counts comparable between samples.

1.2 Processing of ChIP-seq data

1.2.1 Mapping

Reads were mapped to Homo sapiens genome (assembly hg38) using Bowtie (Langmead et al. 2009) v1.0.0 with default parameters except for “-p 3 -m 1 –strata –best –chunkmbs 128.” The following table shows the number of reads aligned to the Homo sapiens genome.

Mapping statistics of ChIP-seq data analyzed during this training session. Raw corresponds to the number of sequenced reads. Aligned corresponds to the number of reads aligned exactly 1 time. Multimapped corresponds to the number of reads aligned > 1 times. Unmapped corresponds to the number of reads aligned 0 time.
Sample ID Sample name raw reads aligned reads multimapped reads unmapped reads
SRR1594290 MITF_CHIP-seq 84856252 54369181 (64.07%) 12823841 (15.11%) 17663230 (20.82%)
SRR1594291 SOX10_CHIP-seq 38202152 29594694 (77.47%) 6128890 (16.04%) 2478568 (6.49%)
SRR1594292 BRG1siCTRL_CHIP-seq 50190992 40134919 (79.96%) 6403809 (12.76%) 3652264 (7.28%)
SRR1594293 BRG1siMITF_CHIP-seq 64765675 44480241 (68.68%) 9357361 (14.45%) 10928073 (16.87%)
SRR1594294 GFPsiCTRL_CHIP-seq 38148419 26273410 (68.87%) 5722823 (15.00%) 6152186 (16.13%)
SRR1594295 MITF_Input 29433042 19970925 (67.85%) 4049711 (13.76%) 5412406 (18.39%)
SRR1594296 SOX10_Input 35449561 27381422 (77.24%) 7045192 (19.87%) 1022947 (2.89%)
SRR1596499 BRG1siSOX10_CHIP-seq 42745544 34418403 (80.52%) 6855741 (16.04%) 1471400 (3.44%)

1.2.2 Peak Calling

Prior to peak calling, reads falling into Encode blacklisted regions (“(2014) Mod/Mouse/humanENCODE: Blacklisted Genomic Regions for Functional Genomics Analysis - Anshul Kundaje n.d.) were removed using bedtools intersect v2.26.0 (Quinlan and Hall 2010). Then peak calling was done with Macs2 v2.1.1 with default parameters.

Number of peaks detected
IP sample Input sample No. of peaks
BRG1siCTRL_CHIP-seq GFPsiCTRL_CHIP-seq 72024
BRG1siMITF_CHIP-seq GFPsiCTRL_CHIP-seq 33984
BRG1siSOX10_CHIP-seq GFPsiCTRL_CHIP-seq 73267
MITF_CHIP-seq MITF_Input 9702
SOX10_CHIP-seq SOX10_Input 4538

1.2.3 Generation of BigWig files

Normalized BigWig files were generated using Homer (Heinz et al. 2010) makeUCSCfile v4.11.0 with the following parameter ’-norm 1e7’ meaning that data were normalized to 10M reads.

2 Description and statistics of BRG1 dataset

2.1 Load data

Peak files are in narrowPeak format which is of the form (source):

  1. chrom - Name of the chromosome (or contig, scaffold, etc.).
  2. chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
  3. chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
  4. name - Name given to a region (preferably unique). Use “.” if no name is assigned.
  5. score - Indicates how dark the peak will be displayed in the browser (0-1000). If all scores were "‘0"’ when the data were submitted to the DCC, the DCC assigned scores 1-1000 based on signal value. Ideally the average signalValue per base spread is between 100-1000.
  6. strand - +/- to denote strand or orientation (whenever applicable). Use “.” if no orientation is assigned.
  7. signalValue - Measurement of overall (usually, average) enrichment for the region.
  8. pValue - Measurement of statistical significance (-log10). Use -1 if no pValue is assigned.
  9. qValue - Measurement of statistical significance using false discovery rate (-log10). Use -1 if no qValue is assigned.
  10. peak - Point-source called for this peak; 0-based offset from chromStart. Use -1 if no point-source called.
## The package ChIPseeker provides a function to load peak files such as narrowPeaks as GRanges objects
## Here BRG1 peak set is loaded into a list of peaks
## this list can be extended if there are more datasets
library(ChIPseeker)
## 
## ChIPseeker v1.26.2  For help: https://guangchuangyu.github.io/software/ChIPseeker
## 
## If you use ChIPseeker in published research, please cite:
## Guangchuang Yu, Li-Gen Wang, Qing-Yu He. ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics 2015, 31(14):2382-2383
peaks <- list()
peaks[["BRG1"]] <- readPeakFile(paste0(params$path.peaks, params$brg1), as="GRanges")

peaks
## $BRG1
## GRanges object with 72024 ranges and 7 metadata columns:
##           seqnames            ranges strand |                     V4        V5
##              <Rle>         <IRanges>  <Rle> |            <character> <integer>
##       [1]     chr1     980402-981178      * | BRG1siCTRL_CHIP-seq_..        77
##       [2]     chr1     983476-984526      * | BRG1siCTRL_CHIP-seq_..       108
##       [3]     chr1   1000729-1001179      * | BRG1siCTRL_CHIP-seq_..        90
##       [4]     chr1   1001762-1002054      * | BRG1siCTRL_CHIP-seq_..        40
##       [5]     chr1   1020914-1021207      * | BRG1siCTRL_CHIP-seq_..       114
##       ...      ...               ...    ... .                    ...       ...
##   [72020]     chrY 18936757-18937116      * | BRG1siCTRL_CHIP-seq_..        37
##   [72021]     chrY 19577774-19578316      * | BRG1siCTRL_CHIP-seq_..        73
##   [72022]     chrY 19891214-19892094      * | BRG1siCTRL_CHIP-seq_..       114
##   [72023]     chrY 19892553-19893192      * | BRG1siCTRL_CHIP-seq_..        55
##   [72024]     chrY 21837647-21838039      * | BRG1siCTRL_CHIP-seq_..        55
##                    V6        V7        V8        V9       V10
##           <character> <numeric> <numeric> <numeric> <integer>
##       [1]           .   5.81321  10.22971   7.71662       166
##       [2]           .   5.97792  13.59661  10.83201       384
##       [3]           .   6.65677  11.62894   9.01201       199
##       [4]           .   3.78100   6.17367   4.02109       209
##       [5]           .   8.11634  14.29797  11.46427       158
##       ...         ...       ...       ...       ...       ...
##   [72020]           .   4.53175   5.93549   3.79936       167
##   [72021]           .   6.32404   9.87728   7.38650       165
##   [72022]           .   7.86030  14.29797  11.46427       711
##   [72023]           .   5.29987   7.83781   5.50942       390
##   [72024]           .   5.55592   7.83781   5.50942       152
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

Peaks are stored as GenomicRanges objects; this is an R format which look like the bed format, but is optimized in terms of memory requirements and speed of execution.

We can start by computing some basic statistics on the peak sets.

2.2 How many peaks were called?

Compute the number of peaks per dataset.

# sapply() function takes list, vector or data frame as input and gives output in vector or matrix
# sapply apply the same function (here length) to all elements
# of the list "peaks"
sapply(peaks,length)
##  BRG1 
## 72024

Make a simple barplot showing the number of peaks per chipped protein.

barplot(sapply(peaks,length))

Let’s create a barplot with ggplot2 out of this data.

# Load ggplot2 library
library(ggplot2)

# create a table with the data to display
peak.lengths <- data.frame(IP=names(peaks),
                           NbPeaks=sapply(peaks,length))

# make the barplot
ggplot(peak.lengths, aes(x=IP, y=NbPeaks)) +
         geom_bar(stat="identity")

# Let's add colors to the barplot
# In R it exists some already defined colors palettes
# the most widely used palette is RColorBrewer.
# This R library offers several color palettes
# See:
library(RColorBrewer)
par(mar=c(3,4,2,2))
display.brewer.all()

# now lets add colors to the barplot
# first ass the new information, fill=IP to let ggplot know
# that colors change based on chipped protein
ggplot(peak.lengths, aes(x=IP, y=NbPeaks, fill=IP)) +
         geom_bar(stat="identity")

# if we want to use colors from RColorBrewer library
# with the "Set1" color palette
ggplot(peak.lengths, aes(x=IP, y=NbPeaks, fill=IP)) +
         geom_bar(stat="identity")+
         scale_fill_brewer(palette="Set1")

2.3 How large are these peaks?

Compute statistics on BRG1 peak sizes

## we use the function width() from GenomicRanges
library(GenomicRanges)
summary(width(peaks$BRG1))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   254.0   308.0   427.0   578.7   671.0  9186.0

Create a simple boxplot of the peak sizes.

peak.width <- lapply(peaks,width)
boxplot(peak.width)

Now, create a nice looking boxplot with ggplot2. ggplot takes a data frame as input. We can either create a date frame, this is what we have done when we created the barplot. Here, we are going to use a package that creates a data frame from other types of data: reshape2.

# Load the package
library(reshape2)
peak.width.table <- melt(peak.width)
head(peak.width.table)
##   value   L1
## 1   777 BRG1
## 2  1051 BRG1
## 3   451 BRG1
## 4   293 BRG1
## 5   294 BRG1
## 6   351 BRG1
## create boxplot
ggplot(peak.width.table, aes(x=L1, y=value)) +
         geom_boxplot()

Enhance it.

# - theme_classic() change grey background to white background
# - fill=L1 and scale_fill_brewer(palette="Set1") colors boxplots
# based on chipped protein and with colors from RColorBrewer Set1 palette
# - labs changes x and y axis labels and legend title
# - scale_y_log10() set y axis to a log scale so that we can have a nice
# view of the data in small values
ggplot(peak.width.table, aes(x=L1, y=value, fill=L1)) +
         geom_boxplot()+
         theme_classic()+
         scale_fill_brewer(palette="Set1")+
         labs(x = "", y = "Peak sizes", fill = "")+
         scale_y_log10()

2.4 Peak filtering

To make sure we keep only high quality data. We are going to select one those peaks having a qValue >= 8. The qValue corresponds to the 9th column of narrowPeak files. So, we are going to set a threshold on this.

## Select high quality peaks
peaks$BRG1 <- peaks$BRG1[peaks$BRG1$V9 >= 8,]

## Compute the number of remaining peaks
length(peaks$BRG1)
## [1] 30874

2.5 Where are the peaks located over the whole genome?

Sometime, peaks may occur more in some chromosoms than others. We can display the genomic distribution of the peaks along the chromosomes, using the covplot function from ChIPSeeker. Height of peaks is drawn based on the peak scores.

# genome wide BRG1 peak distribution
covplot(peaks$BRG1, weightCol="V5")

# chromosome wide BRG1 peak distribution
covplot(peaks$BRG1, chrs=c("chr1", "chr2"), weightCol="V5")

2.6 Functional annotation: genomic features enriched in BRG1 peaks

2.6.1 Generate annotation

We can assign peaks to the closest genes and genomic features (introns, exons, promoters, distal regions, etc…). This is done using the function annotatePeak which compares peak positions with the genomic feature positions of the reference genome. This function returns a complex object which contains all this information.

## org.Hs.eg.db is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.
library(org.Hs.eg.db)

## Annotate peaks for all datasets and store it in a list
## Here TSS regions are regions -1000Kb/+100b arount TSS positions
## Peak annotations are stored in a list
## Load transcript annotation
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene

peakAnno <- list()
peakAnno[["BRG1"]] = annotatePeak(peaks$BRG1, tssRegion=c(-1000, 100), TxDb=txdb, annoDb="org.Hs.eg.db")
## >> preparing features information...      2021-05-20 18:07:08 
## >> identifying nearest features...        2021-05-20 18:07:10 
## >> calculating distance from peak to TSS...   2021-05-20 18:07:12 
## >> assigning genomic annotation...        2021-05-20 18:07:12 
## >> adding gene annotation...          2021-05-20 18:07:56 
## >> assigning chromosome lengths           2021-05-20 18:07:56 
## >> done...                    2021-05-20 18:07:56
## Visualize and export annotation as a data table
# as.data.frame(peakAnno$BRG1)
head(as.data.frame(peakAnno$BRG1))
##   seqnames   start     end width strand                          V4  V5 V6
## 1     chr1  983476  984526  1051      *  BRG1siCTRL_CHIP-seq_peak_2 108  .
## 2     chr1 1000729 1001179   451      *  BRG1siCTRL_CHIP-seq_peak_3  90  .
## 3     chr1 1020914 1021207   294      *  BRG1siCTRL_CHIP-seq_peak_5 114  .
## 4     chr1 1069156 1070192  1037      * BRG1siCTRL_CHIP-seq_peak_10  96  .
## 5     chr1 1079517 1080087   571      * BRG1siCTRL_CHIP-seq_peak_11 179  .
## 6     chr1 1304853 1306013  1161      * BRG1siCTRL_CHIP-seq_peak_15 108  .
##        V7       V8       V9 V10
## 1 5.97792 13.59661 10.83201 384
## 2 6.65677 11.62894  9.01201 199
## 3 8.11634 14.29797 11.46427 158
## 4 6.90466 12.34852  9.65604 410
## 5 9.95583 21.25171 17.96153 213
## 6 5.97792 13.59661 10.83201 176
##                                          annotation geneChr geneStart geneEnd
## 1                                 Distal Intergenic       1    975204  982093
## 2                                          Promoter       1   1001138 1014540
## 3 Intron (ENST00000379370.7/375790, intron 1 of 35)       1   1020123 1056118
## 4 Exon (ENST00000412397.2/100288175, exon 10 of 10)       1   1070967 1074306
## 5                                Downstream (1-2kb)       1   1082146 1084072
## 6                                          Promoter       1   1292390 1305929
##   geneLength geneStrand geneId      transcriptId distanceToTSS         ENSEMBL
## 1       6890          2  84808 ENST00000341290.6         -1383 ENSG00000187642
## 2      13403          1   9636 ENST00000624697.4             0 ENSG00000187608
## 3      35996          1 375790 ENST00000620552.4           791 ENSG00000188157
## 4       3340          2 401934 ENST00000453464.3          4114 ENSG00000237330
## 5       1927          2  54991 ENST00000464905.1          3985 ENSG00000131591
## 6      13540          2 116983 ENST00000492936.5             0 ENSG00000131584
##     SYMBOL                                                 GENENAME
## 1    PERM1             PPARGC1 and ESRR induced regulator, muscle 1
## 2    ISG15                            ISG15 ubiquitin like modifier
## 3     AGRN                                                    agrin
## 4   RNF223                                  ring finger protein 223
## 5 C1orf159                      chromosome 1 open reading frame 159
## 6    ACAP3 ArfGAP with coiled-coil, ankyrin repeat and PH domains 3

All the peak information contained in the peak list will be retained in the output of annotatePeak. The position and strand information of nearest genes are reported. The distance from peak to the TSS of its nearest gene is also reported. The genomic region of the peak is reported in annotation column. Since some annotation may overlap, ChIPseeker adopted the following priority in genomic annotation :

  • Promoter
  • 5’ UTR
  • 3’ UTR
  • Exon
  • Intron
  • Downstream
  • Intergenic
  • Downstream is defined as the downstream of gene end.

This hierachy can be customized using the parameter genomicAnnotationPriority.

annotatePeak report detail information when the annotation is Exon or Intron, for instance “Exon (uc002sbe.3/9736, exon 69 of 80),” means that the peak is overlap with an Exon of transcript uc002sbe.3, and the corresponding Entrez gene ID is 9736 (Transcripts that belong to the same gene ID may differ in splice events), and this overlaped exon is the 69th exon of the 80 exons that this transcript uc002sbe.3 prossess.

Parameter annoDb is optional, if provided, extra columns including SYMBOL, GENENAME, ENSEMBL/ENTREZID will be added. The geneId column in annotation output will be consistent with the geneID in TxDb. If it is ENTREZID, ENSEMBL will be added if annoDb is provided, while if it is ENSEMBL ID, ENTREZID will be added.

– Reminder: The TxDb class is a container for storing transcript annotations.

  • Bioconductor provides several packages containing TxDb objects for model organisms sur as Human and mouse. For instance, TxDb.Hsapiens.UCSC.hg38.knownGene, TxDb.Hsapiens.UCSC.hg19.knownGene for human genome hg38 and hg19, TxDb.Mmusculus.UCSC.mm10.knownGene and TxDb.Mmusculus.UCSC.mm9.knownGene for mouse genome mm10 and mm9, etc.

  • User can also prepare their own TxDb by retrieving information from UCSC Genome Bioinformatics and BioMart data resources by R function makeTxDbFromBiomart and makeTxDbFromUCSC.

  • One can also create a TxDb objects for his favourite organism using an annotation file in GTF/GFF format using the function makeTxDbFromGFF or the package GenomicFeatures.

Expand to find Coturnix japonica example
## download GTF file
download.file("https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/93934/101/GCF_001577835.2_Coturnix_japonica_2.1/GCF_001577835.2_Coturnix_japonica_2.1_genomic.gtf.gz", "Coturnix_japonica_2.1.annotation.gtf.gz")

## Build TxDb object
library(GenomicFeatures)
txdb = makeTxDbFromGFF("Coturnix_japonica_2.1.annotation.gtf.gz")

## To save the txdb database
library(AnnotationDbi)
saveDb(txdb, 'txdb.Coturnix_japonica_2.1.sqlite')

## load it when needed
library(AnnotationDbi)
txdb = loadDb(file = 'txdb.Coturnix_japonica_2.1.sqlite')

2.6.2 Visualize genomic annotation

We can now analyze more in details genomic features associated to our peaks (introns, exons, promoters, distal regions,…).

## distribution of genomic features for BRG1 peaks
# as a pie chart - which is the most widely used representation in publication
plotAnnoPie(peakAnno$BRG1)

# as a barplot
plotAnnoBar(peakAnno$BRG1)

However since some annotation overlap, ChIPseeker provides functions that help having a view of full annotation overlap.

library(UpSetR)
upsetplot(peakAnno$BRG1)

2.6.3 Importance of the annotation database

In the previous section, we have annotated peaks using annotation from UCSC’s knownGene. The database chosen for annotation can have an impact on subsequent results including data integration such as comparison between genes associated to peaks and gene expression using RNA-seq data for instance.

Let’s look at the overlap between genes expressed in RNA-seq data with genes associated to peaks (let’s remind that peaks are associated to closest genes if no other evidences are used).

## Download GTF files
library(GenomicFeatures)
## GTF file downloaded from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf.gz
txdb.refseq = makeTxDbFromGFF("data/GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf.gz")
txdb.ensembl = makeTxDbFromGFF("data/Homo_sapiens.GRCh38.103_UCSC_chr.gtf.gz")

## Ensembl TxDB can also been download using the function makeTxDbFromBiomart
# txdb.ensembl <- makeTxDbFromBiomart(biomart="ENSEMBL_MART_ENSEMBL",
  # host="nov2020.archive.ensembl.org", dataset="hsapiens_gene_ensembl")

annot <- list()
annot[["ensembl"]] <- annotatePeak(peaks$BRG1, tssRegion=c(-1000, 100), TxDb=txdb.ensembl, annoDb="org.Hs.eg.db")
annot[["refseq"]] <- annotatePeak(peaks$BRG1, tssRegion=c(-1000, 100), TxDb=txdb.refseq, annoDb="org.Hs.eg.db")

rnaseq <- read.table(paste0(params$path.rnaseq, params$rnaseq), header=TRUE,
          check.names = FALSE, quote="", sep="\t")

2.7 Heatmaps: visualisation of binding profiles

Heatmap are widely used representation of ChIP-seq data as it allows visualization of read enrichment at various locations at the same time. For instance, one may want to represent read of its chipped protein enrichment +/-5Kb around all TSS of the reference genome or compare read enrichment at the same locations in many chip-seq datasets.

2.7.1 BRG1 peak distribution

We want to know if BRG1 is binding large and/or narrow regions, unique and/or tandem etc.

First of all, for calculating the profile of ChIP peaks binding to the BRG1 center, we need to define peak centers and extend them each side. Then align the peaks that are mapped to these regions, and generate the tagMatrix.

## Load the library
library(GenomicRanges)

# Compute the center of the peaks and attribute it to a new
# column in the metadata of the BRG1 peak GRanges
peaks$BRG1$center.peak <- (start(peaks$BRG1) + end(peaks$BRG1))/2

# For computation and memory efficiency reasons,
# we subset the top 10K peaks according to the FDR column (V9)
top.10000 <- peaks$BRG1[order(peaks$BRG1$V9,decreasing=T)][1:10000]

# Generate peak center GRanges for the 10K top peaks
centers.BRG1 <- GRanges(seqnames(top.10000),
                        IRanges(start=top.10000$center.peak,
                                end=top.10000$center.peak))

# Extend each side of 2000 bp
extended.2K.BRG1 <- centers.BRG1
ranges(extended.2K.BRG1) <- IRanges(start=start(centers.BRG1)-2000,
                                    end=end(centers.BRG1)+2000)

## compute the density of peaks within the promoter regions
tagMatrix <- getTagMatrix(peaks$BRG1, windows=extended.2K.BRG1)

## plot the density
tagHeatmap(tagMatrix, xlim=c(-2000, 2000), color="red")

The regions are ordered relatively to their peak enrichment.

We can visualize a summary of the binding profiles looking at the corresponding average profile. This kind of profiles is much less greedy, we can extend a bit more (e.g. +/- 5000) from the peak centers redoing the previous steps. Definition of the regions have to be redone. Try to create the GRanges and the tagMatrix.

“Click to expand and see the code”
# Extend from the peak center
extended.5K.BRG1 <- centers.BRG1
ranges(extended.5K.BRG1) <- IRanges(start=start(centers.BRG1)-5000,
                                    end=end(centers.BRG1)+5000)

## compute the density of peaks within the promoter regions
tagMatrix <- getTagMatrix(peaks$BRG1, windows=extended.5K.BRG1)


# Plot the profile
plotAvgProf(tagMatrix, xlim=c(-5000, 5000),
            xlab="Distance to peak center", ylab = "Peak Count Frequency")

It looks like BRG1 is having several binding patterns but the binary nature of the signal (presence/absence of peaks) and the ordering of the rows do not allow us to appreciate them.

2.7.2 Read enrichment in BRG1 peaks

For computation and memory efficiency reasons, we are not going to look at read coverage of each position in the regions of interest but they are rather split into non-overlapping equally sized windows. Thus, we need to build a matrix composed of rows that are all BRG1 peaks and columns that contain read enrichment in all bins.

This allows to consider a bigger set of peaks and covered region. We will now analyze the whole set of BRG1 peaks over 10Kb (+/- 5000bp). We will ask for a hundred bins each side of the center resulting in 200 windows of 50 bp.

2.7.2.1 Prepare the signal matrices

We need to load bigwig files for all datasets that we want to visualize. Data are imported using a function from the rtracklayer package.

# # Generate peak center GRanges for all the peaks
centers.BRG1 <- GRanges(seqnames(peaks$BRG1),
                        IRanges(start=peaks$BRG1$center.peak,
                                end=peaks$BRG1$center.peak))

# Extend each side of 5000 bp
extended.5K.BRG1 <- centers.BRG1
ranges(extended.5K.BRG1) <- IRanges(start=start(centers.BRG1)-5000,
                                    end=end(centers.BRG1)+5000)

# load the library
library(rtracklayer)

# load the bw file for BRG1
cvg.BRG1 <- import(file.path(params$path.bw,params$brg1.bw),format="BigWig",
                      which=extended.5K.BRG1,
                      as="RleList")
cvg.BRG1
## RleList of length 25
## $chr1
## numeric-Rle of length 248956422 with 1333289 runs
##   Lengths: 979000      3     48     15     32 ...   1288     73     36  22594
##   Values :   0.00   0.32   0.16   0.49   0.32 ...   0.00   0.17   0.33   0.00
## 
## $chr10
## numeric-Rle of length 133797422 with 510080 runs
##   Lengths: 368603    154     85    154    304 ...     90     58    154 272220
##   Values :   0.00   0.16   0.00   0.16   0.00 ...   0.65   0.00   0.33   0.00
## 
## $chr11
## numeric-Rle of length 135086622 with 542307 runs
##   Lengths: 202802     21    117    221     41 ...      9     70    154  74826
##   Values :   0.00   0.49   0.16   0.00   0.16 ...   0.16   0.00   0.16   0.00
## 
## $chr12
## numeric-Rle of length 133275309 with 564715 runs
##   Lengths: 194505     18    115    154    136 ...     22     68     19 189737
##   Values :   0.00   0.32   0.00   0.16   0.00 ...   0.16   0.65   0.49   0.00
## 
## $chr13
## numeric-Rle of length 114364328 with 338291 runs
##   Lengths: 19776001       19        5       12 ...       10       13    43427
##   Values :     0.00     1.29     1.13     0.97 ...     0.16     0.65     0.00
## 
## ...
## <20 more elements>

Now we can create matrix of read enrichments at the positions of interest. We are using functions from ChIPpeakAnno package.

# load library
library(ChIPpeakAnno)

# featureAlignedSignal needs the coverage (cvg) stored in a list.
cvglist <- list(BRG1=cvg.BRG1)

sig <- featureAlignedSignal(cvglist, centers.BRG1,n.tile=200,
                           upstream=5000, downstream=5000)

dim(sig$BRG1)
## [1] 30874   200

2.7.2.2 Create heatmaps

Let’s draw the heatmaps using the EnrichedHeatmap library.

## Load the library
library(EnrichedHeatmap)

## Create a list of normalizedMatrix that is the input format
## for EnrichedHeatmap
mat1 <- list()
mat1[["BRG1"]] <- as.normalizedMatrix(as.matrix(sig[["BRG1"]]),
    k_upstream = 100,
    k_downstream = 100,
    k_target = 0,
    extend = c(5000, 5000),
    signal_name = names(sig[["BRG1"]]),
    target_name = "Peak centers"
)

## Create the Heatmap with default parameters
EnrichedHeatmap(mat1$BRG1, name = "BRG1")

EnrichedHeatmap combines the average profile and the density heatmap. We can observe a greater precision of the signal around the peak centers. As with ChIPpeakAnno, by default, heatmaps are sorted by read enrichment. However, it would be worth grouping together regions that have similar read enrichment pattern. This can be done using a clustering method such as k-means. This type of clustering requires the number of expected clusters to be set. Moreover, to obtain reproductible clustering results, we need to set a seed.

## define a seed value to get the same results when re-running the analysis
set.seed(123)
## Create Heatmaps with k-means clustering on BRG1 data
## We keep the generated object in order to use the clustering
## information.
heatmap.kmeans <- EnrichedHeatmap(mat1$BRG1, name = "BRG1", row_km = 8,
    column_title = "BRG1", row_title_rot = 0)
## draw the heatmap
draw(heatmap.kmeans)
## Warning: did not converge in 10 iterations

2.7.2.3 Are BRG1 binding profiles associated with particular genomic regions ?

We will use here the clusters obtained with EnrichedHeatmap and look at the genomic distribution of peaks using TxDb.Hsapiens.UCSC.hg38.knownGene and plotAnnoBar.

# Use row_order function to retrieve peaks index belonging to each cluster
clusters <- row_order(heatmap.kmeans)
# Check the class and length of the resulting object
class(clusters)
## [1] "list"
length(clusters)
## [1] 8
# Check what is actually in the list elements
head(clusters[[1]])
## [1]  1815 22265 19335 19294 19967 19336
# Each element of the list contain the indexes of the peaks in the
# original object
# Check cluster sizes
lapply(clusters,length)
## $`1`
## [1] 1226
## 
## $`2`
## [1] 1656
## 
## $`3`
## [1] 1432
## 
## $`4`
## [1] 1485
## 
## $`5`
## [1] 1028
## 
## $`6`
## [1] 5306
## 
## $`7`
## [1] 5574
## 
## $`8`
## [1] 13167
# The following code uses each element of the list (peaks indexes)
# to select corresponding peaks in the original data and return them
BRG1.clusters <- lapply(clusters,function(x,peaks){
  # x represents one element of the list, one set of
  # indexes
  cl <- peaks[x]
  return(cl)
},peaks=peaks$BRG1)

# check names of the resulting list
names(BRG1.clusters)
## [1] "1" "2" "3" "4" "5" "6" "7" "8"
# rename the elements
names(BRG1.clusters) <- paste0("cluster",names(clusters))

# Transform the list as GRangesList to be able to use it with the annotatePeak
# function
BRG1.clusters <- as(BRG1.clusters,"GRangesList")

# apply annotatePeak function to each element of the BRG1.clusters GRangesList
peakAnnoList <- lapply(BRG1.clusters, annotatePeak, TxDb=txdb,
                       tssRegion=c(-1000, 100), annoDb="org.Hs.eg.db",
                       verbose=FALSE)

plotAnnoBar(peakAnnoList)

3 Data integration: let’s compare BRG1 to MITF and SOX10

3.1 Dataset description

3.1.1 Load MITF and SOX10 datasets

## Let's load MITF and SOX10 peak sets
peaks[["MITF"]] <- readPeakFile(paste0(params$path.peaks, params$mitf), as="GRanges")
peaks[["SOX10"]] <- readPeakFile(paste0(params$path.peaks, params$sox10), as="GRanges")

3.1.2 How many peaks were called?

Compute the number of datasets.

length(peaks)
## [1] 3
# Load ggplot2 library
library(ggplot2)

peak.lengths <- data.frame(IP=names(peaks),
                           NbPeaks=sapply(peaks,length))

ggplot(peak.lengths, aes(x=IP, y=NbPeaks, fill=IP)) +
         geom_bar(stat="identity")+
         scale_fill_brewer(palette="Set1")

3.1.3 How large are these peaks?

Compute statistics on all peak sizes

peak.width = lapply(peaks,width)

peak.width.table <- melt(peak.width)

In this table, there is as many rows as the total number of peaks in all peak sets. It contains all possible pairs IP <-> number of peaks possible.

# Number of peaks per chipped protein
sapply(peaks, length)
##  BRG1  MITF SOX10 
## 30874  9702  4538
# total number of peaks
sum(sapply(peaks, length))
## [1] 45114
# size of the table we've just generated
dim(peak.width.table)
## [1] 45114     2
ggplot(peak.width.table, aes(x=L1, y=value, fill=L1)) +
         geom_boxplot()+
         theme_classic()+
         scale_fill_brewer(palette="Set1")+
         labs(x = "", y = "Peak sizes", fill = "")+
         scale_y_log10()

3.2 Peak filtering

To make sure we keep only high quality data. We are going to select one those peaks having a qValue >= 4. The qValue corresponds to the 9th column of narrowPeak files. So, we are going to set a threshold on this.

## Select high quality peaks
peaks <- lapply(peaks, function(x){x[x$V9 >= 4,]})

## Compute the number of remaining peaks
sapply(peaks, length)
##  BRG1  MITF SOX10 
## 30874  9279  4214

3.2.1 Peak annotation

peakAnno[["MITF"]] = annotatePeak(peaks$MITF, tssRegion=c(-1000, 100), TxDb=txdb, annoDb="org.Hs.eg.db")
## >> preparing features information...      2021-05-20 18:11:33 
## >> identifying nearest features...        2021-05-20 18:11:33 
## >> calculating distance from peak to TSS...   2021-05-20 18:11:34 
## >> assigning genomic annotation...        2021-05-20 18:11:34 
## >> adding gene annotation...          2021-05-20 18:11:40 
## >> assigning chromosome lengths           2021-05-20 18:11:40 
## >> done...                    2021-05-20 18:11:40
peakAnno[["SOX10"]] = annotatePeak(peaks$SOX10, tssRegion=c(-1000, 100), TxDb=txdb, annoDb="org.Hs.eg.db")
## >> preparing features information...      2021-05-20 18:11:40 
## >> identifying nearest features...        2021-05-20 18:11:40 
## >> calculating distance from peak to TSS...   2021-05-20 18:11:41 
## >> assigning genomic annotation...        2021-05-20 18:11:41 
## >> adding gene annotation...          2021-05-20 18:11:47 
## >> assigning chromosome lengths           2021-05-20 18:11:48 
## >> done...                    2021-05-20 18:11:48
plotAnnoBar(peakAnno)

3.3 Are BRG1, MITF and SOX10 co-localizing ?

3.3.1 Compare BRG1, MITF and SOX10 peak positions (Venn Diagram)

We can evaluate the peak overlapping with a Venn diagram using the ChIPpeakAnno package.

library(ChIPpeakAnno)
# We first compute the overlap between peak sets, keeping the information
# of all peaks overlapping in each set (see ?findOverlapsOfPeaks for help)
ovl <- findOverlapsOfPeaks(peaks, connectedPeaks="keepAll")

ChIPpeakAnno imposes, while plotting the Venn diagram, to compute the significance of the pairwise associations using a hypergeometric test. To this end, we need to estimate the number of all potential binding events which is used by the makeVennDiagram function through the totalTest number. It is used for the hypergeometric sampling that is used to determine if the overlap between two datasets is more than would be expected by chance. This is not a trivial question, the answer is driven by what you know about the binding properties of your factors (eg. sequence specific, mainly intergenic etc). You can find an interesting discussion here. In our case we can refer to the genomic distribution of the peaks that we have plotted previously. We can assume that our TFs have a gene body binding preference. Genes cover roughly 10% of the genome.

# Estimate the average size of the peaks ...
averagePeakWidth <- mean(width(unlist(GRangesList(ovl$peaklist))))

# ... to count how many potential sites could have been bound in coding regions.
tot <- ceiling(3.3e+9 * 0.1 / averagePeakWidth)

We will define the colors attributed to each set using the function colours(). In any case you want to set a color you can use this function.

fill.colors <- c(BRG1=colours()[373],MITF=colours()[591],SOX10=colours()[102])
circle.font.colors <- c(BRG1=colours()[35],MITF=colours()[618],SOX10=colours()[614])

makeVennDiagram(ovl, totalTest=tot, connectedPeaks="keepAll",
                fill=fill.colors, # circle fill color
                col=circle.font.colors, #circle border color
                cat.col=circle.font.colors)

## $p.value
##      BRG1 MITF SOX10          pval
## [1,]    0    1     1 7.824528e-215
## [2,]    1    0     1  0.000000e+00
## [3,]    1    1     0  0.000000e+00
## 
## $vennCounts
##      BRG1 MITF SOX10 Counts count.BRG1 count.MITF count.SOX10
## [1,]    0    0     0 411865          0          0           0
## [2,]    0    0     1   1127          0          0        1127
## [3,]    0    1     0   6930          0       6930           0
## [4,]    0    1     1     40          0         40          40
## [5,]    1    0     0  24781      24781          0           0
## [6,]    1    0     1   2547       3491          0        2572
## [7,]    1    1     0   1784       1907       1822           0
## [8,]    1    1     1    454        695        487         475
## attr(,"class")
## [1] "VennCounts"

According to the hypergeometric test p-values all pairwise comparisons are highly significant (=0). This has to be taken very carefully as it depends largely on the background estimation that we may have over-estimated. If you are not sure about your estimation, you will prefer to use a non-parametric approach based on your peak genomic distribution to estimate randomness. To this end the TxDb.Hsapiens.UCSC.hg38.knownGene is used.

3.3.1.1 Is the overlap significant?

ChIPpeakAnno provides the preparePool and peakPermTest functions to compute this test. These tests are made by pair, we will look at SOX10/MITF overlap significance as an example.

# Prepare a pool of random peaks following the characteristics of our peak sets
pool <- preparePool(txdb,peaks$SOX10,bindingType="TSS",featureType="transcript",seqn=paste0("chr",c(1:22,"X","Y")))
# Create the permPool object needed for peakPermTest
pool <- new("permPool",grs=pool$grs[1],N=length(peaks$SOX10))
SOX10.MITF <- peakPermTest(peaks$SOX10, peaks$MITF, pool=pool, seed=1, force.parallel=FALSE)
SOX10.MITF
## $cntOverlaps
## P-value: 0.0099009900990099
## Z-score: 139.5137
## Number of iterations: 100
## Alternative: greater
## Evaluation of the original region set: 437
## Evaluation function: cntOverlaps
## Randomization function: randPeaks
## 
## attr(,"class")
## [1] "permTestResultsList"
plot(SOX10.MITF)

Venn diagrams are widely used to represent overlaps, intersections. However, in ChIP-seq analysis, the definition of the peaks is dependent on the peak caller, the FDR thresholds (what about peaks just bellow the threshold). The overlap is also difficult to assess, indeed do we call an overlap a 1 nucleotide an intersection ? It is one of the numerous parameters that can be tuned …

Are different combinations of TFs bind specific genomic regions ?

coocs <- as(ovl$peaklist,"GRangesList")
peakAnnoList <- lapply(coocs, annotatePeak, TxDb=txdb,
                        tssRegion=c(-1000, 100), annoDb="org.Hs.eg.db",
                        verbose=FALSE)
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
plotAnnoBar(peakAnnoList)

plotDistToTSS(peakAnnoList)

3.3.2 Heatmaps / Profiles

One way to circumvent hard thresholds and relatively arbitrary choices, we can choose to use heatmap and average profile representations. To this end, we need to define the reference point of view. We use the BRG1 peak centers.

Load bigwig file for MITF and SOX10 at BRG1 peaks.

cvglist$MITF <- import(file.path(params$path.bw,params$mitf.bw),format="BigWig",
                      which=extended.5K.BRG1,
                      as="RleList")
cvglist$SOX10 <- import(file.path(params$path.bw,params$sox10.bw),format="BigWig",
                      which=extended.5K.BRG1,
                      as="RleList")

Prepare the matrices binned in 50bp windows.

sig <- featureAlignedSignal(cvglist, centers.BRG1,
                           upstream=5000, downstream=5000,n.tile=200)

lapply(sig, dim)
## $BRG1
## [1] 30874   200
## 
## $MITF
## [1] 30874   200
## 
## $SOX10
## [1] 30874   200

Create normalized matrices.

## Create a list of normalizedMatrix that is the input format
## for EnrichedHeatmap

BRG1.mat <- lapply(sig, function(x){
  # x represent each element of the sig list
  mat <- as.normalizedMatrix(as.matrix(x),
                      k_upstream = 100,
                      k_downstream = 100,
                      k_target = 0,
                      extend = c(5000, 5000),
                      #signal_name = names(sig[["MITF"]]),
                      target_name = "Peak center"
                      )
  return(mat)
})

Create heatmap with K-means clustering.

## Create the Heatmap with default parameters
multi.heatmap <- EnrichedHeatmap(BRG1.mat$BRG1, name = "BRG1", row_km = 8,
    column_title = "BRG1", row_title_rot = 0) +
EnrichedHeatmap(BRG1.mat$MITF, name = "MITF",
    column_title = "MITF") +
EnrichedHeatmap(BRG1.mat$SOX10, name = "SOX10",
    column_title = "SOX10")

htlist <- draw(multi.heatmap)

Let’s enhance it!

library(circlize)
# we retrieve the cluster information from the previous command
clusters <- row_order(htlist)
# rename the clusters
names(clusters) <- paste0("cluster",names(clusters))
# transform to a vector
clusters <- unlist(clusters)
head(clusters)
## cluster11 cluster12 cluster13 cluster14 cluster15 cluster16 
##      1815     22265     19335     19294     19967     19336
# names of the elements were extended with a number
# We thus trim them to retrieve the cluster name
names(clusters) <- substring(first = 1,last = 8,text=names(clusters))

# The numbers are the indexes of the rows, we need to
# sort the indexes to get the right order of cluster
# labels
partition <- names(clusters)[order(clusters)]
# Define new colors for each heatmap
col_sox10 <-colorRamp2(c(0,2,3), c("white", "blue","black"))
col_mitf <- colorRamp2(c(0,4,5), c("white", "blue","black"))
col_brg1 <- colorRamp2(c(0,5,6), c("white", "blue","black"))
# create a legend for the cluster labels
lgd <- Legend(at = c("cluster1", "cluster2", "cluster3", "cluster4","cluster5","cluster6","cluster7","cluster8"),
    title = "Clusters",
    type = "lines", legend_gp = gpar(col = 2:9))

# Add a first column containing the cluster assignment
ht_list <- Heatmap(partition, col = structure(2:9, names = paste0("cluster", 1:8)), name = "partition", show_row_names = FALSE, width = unit(3, "mm")) +
EnrichedHeatmap(BRG1.mat$BRG1, name = "BRG1", col=col_brg1,
  top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(col = 2:9))),
  column_title = "BRG1") +
EnrichedHeatmap(BRG1.mat$MITF, name = "MITF",
  top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(col = 2:9))),
  column_title = "MITF", row_title_rot = 0, col=col_mitf) +
EnrichedHeatmap(BRG1.mat$SOX10, name = "SOX10",
  top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(col = 2:9))),
  column_title = "SOX10", row_title_rot = 0, col=col_sox10)

draw(ht_list, split = partition, annotation_legend_list = list(lgd),
    ht_gap = unit(c(2, 8, 8,8), "mm"))

3.4 Is gene expression influenced by TF binding to their promoters ?

Select genes having BRG1 binding close to their TSS. To do that we will plot a heatmap centered on TSS.

# load the library
library(GenomicFeatures)

# Generate TSS and promoter GRanges, the function promoters allows to retrieve the
# gene ID (entrez)
TSS <- GenomicFeatures::promoters(txdb, upstream=0, downstream=0, columns="gene_id")
TSS.extended <- GenomicFeatures::promoters(txdb, upstream=2000, downstream=2000, columns="gene_id")
TSS.extended
## GRanges object with 247541 ranges and 1 metadata column:
##                             seqnames        ranges strand |         gene_id
##                                <Rle>     <IRanges>  <Rle> | <CharacterList>
##   ENST00000456328.2             chr1    9869-13868      + |       100287102
##   ENST00000450305.2             chr1   10010-14009      + |       100287102
##   ENST00000473358.1             chr1   27554-31553      + |                
##   ENST00000469289.1             chr1   28267-32266      + |                
##   ENST00000607096.1             chr1   28366-32365      + |       100302278
##                 ...              ...           ...    ... .             ...
##   ENST00000619779.1 chrUn_GL000220v1 153997-157996      + |       109864274
##   ENST00000620265.1 chrUn_KI270442v1 378608-382607      + |                
##   ENST00000611690.1 chrUn_KI270442v1 215402-219401      - |                
##   ENST00000616830.1 chrUn_KI270744v1   49115-53114      - |                
##   ENST00000612925.1 chrUn_KI270750v1 146668-150667      + |                
##   -------
##   seqinfo: 595 sequences (1 circular) from hg38 genome
# The column gene_id is of type characterList, this type is difficult to
# manipulate. We then decide to transform it to a character vector
gene_ids <- unlist(lapply(TSS$gene_id,function(x){
  if (length(x)==0){
     NA
  }
  else {
    x
  }
} ))

# Keep TSS on selected chromosomes annotated with a gene ID and remove
# duplicate IDs due to transcript isoforms. Here, the selection of the
# isoform is random.
TSS <- TSS[as.vector(seqnames(TSS))%in%paste0("chr",c(1:22,"X","Y")) & !is.na(gene_ids) & !duplicated(gene_ids)]
TSS.extended <- TSS.extended[as.vector(seqnames(TSS.extended))%in%paste0("chr",c(1:22,"X","Y")) & !is.na(gene_ids) & !duplicated(gene_ids)]

# load the bw file for all TFs within the promoter
cvglist <- list()
cvglist$BRG1 <- import(file.path(params$path.bw,params$brg1.bw),format="BigWig",
                      which=TSS.extended,
                      as="RleList")
cvglist$MITF <- import(file.path(params$path.bw,params$mitf.bw),format="BigWig",
                      which=TSS.extended,
                      as="RleList")
cvglist$SOX10 <- import(file.path(params$path.bw,params$sox10.bw),format="BigWig",
                      which=TSS.extended,
                      as="RleList")

# Produce the signal matrices
sig <- featureAlignedSignal(cvglist, TSS,n.tile=400,
                           upstream=2000, downstream=2000)


# Transform the signal matrices as normalizedMatrix
TSS.mat <- lapply(sig, function(x){
  # x represent each element of the sig list
  mat <- as.normalizedMatrix(as.matrix(x),
                      k_upstream = 200,
                      k_downstream = 200,
                      k_target = 0,
                      extend = c(2000, 2000),
                      target_name = "TSS"
                      )
  return(mat)
})

# Cmpute a partition using the kmeans function, asking for 5 clusters
set.seed(20210526)
partition = paste0("cluster", kmeans(TSS.mat$BRG1, centers = 5)$cluster)

# Specify colors for the position enrichment for each matrix
col_sox10 = colorRamp2(c(0,3,4), c("white", "blue","black"))
col_mitf = colorRamp2(c(0,4,5), c("white", "blue","black"))
col_brg1 = colorRamp2(c(0,5,6), c("white", "blue","black"))
lgd = Legend(at = c("cluster1", "cluster2", "cluster3", "cluster4","cluster5"),
    title = "Clusters",
    type = "lines", legend_gp = gpar(col = 2:6))


ht_list = Heatmap(partition, col = structure(2:7, names = paste0("cluster", 1:5)), name = "partition", show_row_names = FALSE, width = unit(3, "mm")) +
EnrichedHeatmap(TSS.mat$BRG1, name = "BRG1", col=col_brg1,
  top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(col = 2:6))),
  column_title = "BRG1") +
EnrichedHeatmap(TSS.mat$MITF, name = "MITF",
  top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(col = 2:6))),
  column_title = "MITF", row_title_rot = 0, col=col_mitf) +
EnrichedHeatmap(TSS.mat$SOX10, name = "SOX10",
  top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(col = 2:6))),
  column_title = "SOX10", row_title_rot = 0, col=col_sox10)

draw(ht_list, split = partition, annotation_legend_list = list(lgd),
    ht_gap = unit(c(2, 8, 8,8), "mm"))

Lets’ integrate expression data from RNA-seq experiment. RNA-seq data were preprocessed, normalized and called for differentially expressed genes.

The goal of the following code is to visualise the gene expression according to the cluster the TSS belong. To this end, we will need to play with identifiers from different databases.

# load RNA-seq normalized expression data
load(params$rnaseq) # RNAseq

head(RNAseq)
##   Ensembl.Gene.ID    WT.norm shBGR1.norm   logCPM      logFC          FDR
## 1 ENSG00000000003  722.38046   119.10483 4.387012  2.5748948 2.233826e-21
## 2 ENSG00000000419 1481.30280  2514.73393 6.689308 -0.6994493 1.391964e-13
## 3 ENSG00000000457  259.21973   314.27718 5.364018 -0.3167911 3.091766e-05
## 4 ENSG00000000460  333.22470   380.48495 5.562143 -0.1954216 2.875151e-03
## 5 ENSG00000000971   13.42921    13.59959 0.341927  0.6570620 2.536438e-02
## 6 ENSG00000001036 1157.52492  1521.75139 5.567526 -0.3587066 3.117933e-06
##   signif
## 1 stable
## 2 stable
## 3 stable
## 4 stable
## 5 stable
## 6 stable
# Convert The ENSEMBL IDs from the RNAseq into ENTREZID to be able to
# map the clusters from the heatmap with the corresponding expression.
# We use an OrgDB object for the human genome
# the available information are displayed using columns()
columns(org.Hs.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"        
## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"       "UNIGENE"     
## [26] "UNIPROT"
conv <- mapIds(x=org.Hs.eg.db,
              keys=RNAseq$Ensembl.Gene.ID, # what do we want to be mapped
              column="ENTREZID", # which type of ID we want
              keytype="ENSEMBL") # what type of ID we give
## 'select()' returned 1:many mapping between keys and columns
head(conv)
## ENSG00000000003 ENSG00000000419 ENSG00000000457 ENSG00000000460 ENSG00000000971 
##          "7105"          "8813"         "57147"         "55732"          "3075" 
## ENSG00000001036 
##          "2519"
# names of the vector are ENSEMBL IDs and elements are ENTREZ IDs
# The match function allows to return match between the IDs in RNAseq
# and the ENSEMBL IDs from the conversion vector
m <- match(RNAseq$Ensembl.Gene.ID,names(conv))
head(m)
## [1] 1 2 3 4 5 6
# m contains NA for element in the first vector not present in the 2nd
# or the index of the element of the 2nd vector corresponding the those
# in the first
# Attribute the matched ENTREZ ID
RNAseq$ENTREZID <- conv[m]


# We add the partition (clusters) to the TSS object in order to add the
# clusters to the RNA-seq object
TSS$cluster <- partition

# The common IDs this time are the ENTREZ IDs
m <- match(RNAseq$ENTREZID,unlist(TSS$gene_id))
RNAseq$cluster <- TSS$cluster[m]

# visualise distributions with ggplot2
library(ggplot2)

# distribution of gene expression in clusters using boxplots
# in wild type cell
ggplot(RNAseq,aes(x=cluster,y=log2(WT.norm+1))) +
  geom_boxplot() +
  theme_bw()

# in cell with a BRG1 KD
ggplot(RNAseq,aes(x=cluster,y=log2(shBGR1.norm+1))) +
  geom_boxplot() +
  theme_bw()

# Distribution of differentially expressed genes as barplots
ggplot(RNAseq,aes(x=cluster)) +
  geom_bar(aes(fill=signif),position = "fill",color="black") +
  theme_bw() +
  scale_fill_manual(values=c(up="red",down="blue",stable="white"))

3.5 Are the differentially expressed genes with different BRG1 binding patterns have different biological function ?

library(clusterProfiler)
geneList <- RNAseq[RNAseq$signif!="stable",]
geneList <- split(geneList$ENTREZID,geneList$cluster)
compKEGG <- compareCluster(geneCluster   = geneList,
                         fun           = "enrichKEGG",
                         pvalueCutoff  = 0.05,
                         pAdjustMethod = "BH")
dotplot(compKEGG, showCategory = 15, title = "KEGG Pathway Enrichment Analysis")

3.5.1 Are there published data I could use to enrich my analysis ?

  • Data Mining with ChIP seq data deposited in GEO with ChIPSeeker

There are many ChIP seq data sets that have been published and deposited in GEO database. We can compare our own dataset to those deposited in GEO to search for significant overlap data. Significant overlap of ChIP seq data by different binding proteins may be used to infer cooperative regulation and thus can be used to generate hypotheses.

We collect about 17,000 bed files deposited in GEO, user can use getGEOspecies to get a summary based on spieces.

getGEOspecies()
##                                               species  Freq
## 1                                       Aedes aegypti    11
## 2                                            Anabaena     6
## 3                                 Anolis carolinensis     5
## 4                                   Anopheles gambiae     2
## 5                                      Apis mellifera     5
## 6                           Apis mellifera scutellata     1
## 7                                  Arabidopsis lyrata     4
## 8                                Arabidopsis thaliana   288
## 9                                Atelerix albiventris     2
## 10                                         Bos taurus    37
## 11                          Branchiostoma lanceolatum    62
## 12                                      Brassica rapa    12
## 13                             Caenorhabditis elegans   189
## 14                                   Candida albicans    25
## 15                               Candida dubliniensis    20
## 16                             Canis lupus familiaris     7
## 17                          Chlamydomonas reinhardtii    51
## 18                               Chlorocebus aethiops     2
## 19                                 Cleome hassleriana     1
## 20                                      Columba livia     6
## 21                                  Crassostrea gigas     1
## 22                            Cryptococcus neoformans    51
## 23                                    Cyprinus carpio    40
## 24                                        Danio rerio   308
## 25                                 Drosophila busckii     1
## 26                            Drosophila melanogaster  1069
## 27            Drosophila melanogaster;\tSindbis virus     3
## 28                                 Drosophila miranda     2
## 29                           Drosophila pseudoobscura     7
## 30                                Drosophila simulans    12
## 31                                 Drosophila virilis    26
## 32                              Drosophila willistoni     1
## 33                                  Drosophila yakuba     8
## 34                                     Equus caballus     1
## 35                                   Escherichia coli    15
## 36                           Escherichia coli BW25113     4
## 37                              Escherichia coli K-12     2
## 38          Escherichia coli str. K-12 substr. MG1655    16
## 39                                      Gallus gallus    58
## 40                       Geobacter sulfurreducens PCA     3
## 41                                    Gorilla gorilla     2
## 42                                  Histophilus somni     1
## 43                                       Homo sapiens 29978
## 44                 Homo sapiens;\tHuman herpesvirus 8     6
## 45                               Human herpesvirus 6B     2
## 46                                Human herpesvirus 8     6
## 47                                Larimichthys crocea     7
## 48                             Legionella pneumophila     5
## 49                             Leishmania amazonensis     4
## 50                                   Leishmania major     2
## 51                   Leishmania major strain Friedlin     4
## 52                              Leishmania tarentolae    15
## 53                                     Macaca mulatta   120
## 54                              Monodelphis domestica     8
## 55                         Moraxella catarrhalis O35E     6
## 56                                                Mus     2
## 57                                       Mus musculus 16748
## 58                         Mus musculus x Mus spretus     1
## 59                         Mycobacterium tuberculosis     2
## 60                                    Myotis brandtii    15
## 61                              Naumovozyma castellii     1
## 62                             Nematostella vectensis    23
## 63                              Oreochromis niloticus     1
## 64                           Ornithorhynchus anatinus     5
## 65                                       Oryza sativa    30
## 66                                    Oryzias latipes     2
## 67                                    Pan troglodytes    93
## 68                                       Papio anubis     1
## 69                              Plasmodium falciparum   129
## 70                          Plasmodium falciparum 3D7    29
## 71                          Pseudomonas putida KT2440     2
## 72                                 Pseudozyma aphidis    11
## 73                                Pyrococcus furiosus     4
## 74                                  Rattus norvegicus   108
## 75                         Rhodopseudomonas palustris     6
## 76                  Rhodopseudomonas palustris CGA009     3
## 77                           Saccharomyces cerevisiae   813
## 78 Saccharomyces cerevisiae x Saccharomyces paradoxus    16
## 79            Saccharomyces cerevisiae;\tMus musculus    12
## 80                         Saccharomyces kudriavzevii     1
## 81                            Saccharomyces paradoxus     8
## 82                               Saccharomyces uvarum     1
## 83                      Schizosaccharomyces japonicus     2
## 84                          Schizosaccharomyces pombe   179
## 85                             Schmidtea mediterranea     7
## 86                               Solanum lycopersicum     2
## 87                                    Sorghum bicolor     2
## 88                              Spodoptera frugiperda    16
## 89                      Streptomyces coelicolor A3(2)     6
## 90                                         Sus scrofa    41
## 91                                Taeniopygia guttata     1
## 92                                   Tupaia chinensis     7
## 93                                    Vibrio cholerae     8
## 94                      Xenopus (Silurana) tropicalis    62
## 95                                     Xenopus laevis    10
## 96                                 Xenopus tropicalis    74
## 97                                           Zea mays    65

User can access the detail information by getGEOInfo, for each genome version.

hg38 <- getGEOInfo(genome="hg38", simplify=TRUE)
head(hg38)
##       series_id        gsm     organism                                 title
## 16488  GSE58207 GSM1403308 Homo sapiens         Lactimidomycin treated HCT116
## 16489  GSE58207 GSM1403308 Homo sapiens         Lactimidomycin treated HCT116
## 16490  GSE58207 GSM1403307 Homo sapiens          Cycloheximide treated HCT116
## 16491  GSE58207 GSM1403307 Homo sapiens          Cycloheximide treated HCT116
## 20345  GSE67978 GSM1660032 Homo sapiens   H3K27ac_Human_Brain_WhiteMatter_HS2
## 20351  GSE67978 GSM1660029 Homo sapiens H3K27ac_Human_Brain_OccipitalPole_HS2
##                                                                                                                                 supplementary_file
## 16488                                   ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1403nnn/GSM1403308/suppl/GSM1403308_HCT116_LTM_sense.bedGraph.gz
## 16489                              ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1403nnn/GSM1403308/suppl/GSM1403308_HCT116_LTM_anti-sense.bedGraph.gz
## 16490                              ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1403nnn/GSM1403307/suppl/GSM1403307_HCT116_CHX_anti-sense.bedGraph.gz
## 16491                                   ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1403nnn/GSM1403307/suppl/GSM1403307_HCT116_CHX_sense.bedGraph.gz
## 20345   ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1660nnn/GSM1660032/suppl/GSM1660032_H3K27ac_Human_Brain_WhiteMatter_HS2_hg38_peaks.narrowPeak.gz
## 20351 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1660nnn/GSM1660029/suppl/GSM1660029_H3K27ac_Human_Brain_OccipitalPole_HS2_hg38_peaks.narrowPeak.gz
##       genomeVersion pubmed_id
## 16488          hg38      <NA>
## 16489          hg38      <NA>
## 16490          hg38      <NA>
## 16491          hg38      <NA>
## 20345          hg38      <NA>
## 20351          hg38      <NA>

ChIPseeker provide function downloadGEObedFiles to download all the bed files of a particular genome.

downloadGEObedFiles(genome="hg38", destDir="hg38")

Or a vector of GSM accession number by downloadGSMbedFiles.

gsm <- hg38$gsm[sample(nrow(hg38), 10)]
downloadGSMbedFiles(gsm, destDir="hg38")

After download the bed files from GEO, we can pass them to enrichPeakOverlap for testing the significant of overlap. Parameter targetPeak can be the folder, e.g. hg19, that containing bed files. enrichPeakOverlap will parse the folder and compare all the bed files. It is possible to test the overlap with bed files that are mapping to different genome or different genome versions, enrichPeakOverlap provide a parameter chainFile that can pass a chain file and liftOver the targetPeak to the genome version consistent with queryPeak. Signifcant overlap can be use to generate hypothesis of cooperative regulation.By mining the data deposited in GEO, we can identify some putative complex or interacted regulators in gene expression regulation or chromosome remodelling for further validation.

  • The (mod)ENCODE project link

  • The UCSC genome browser link

4 Session info

sessionInfo()
## R version 4.0.5 (2021-03-31)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8
## 
## attached base packages:
##  [1] grid      parallel  stats4    stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] clusterProfiler_3.18.1                  
##  [2] circlize_0.4.12                         
##  [3] EnrichedHeatmap_1.20.0                  
##  [4] ComplexHeatmap_2.6.2                    
##  [5] ChIPpeakAnno_3.24.2                     
##  [6] rtracklayer_1.50.0                      
##  [7] UpSetR_1.4.0                            
##  [8] TxDb.Hsapiens.UCSC.hg38.knownGene_3.10.0
##  [9] GenomicFeatures_1.42.3                  
## [10] org.Hs.eg.db_3.12.0                     
## [11] AnnotationDbi_1.52.0                    
## [12] Biobase_2.50.0                          
## [13] reshape2_1.4.4                          
## [14] GenomicRanges_1.42.0                    
## [15] GenomeInfoDb_1.26.7                     
## [16] IRanges_2.24.1                          
## [17] S4Vectors_0.28.1                        
## [18] BiocGenerics_0.36.1                     
## [19] RColorBrewer_1.1-2                      
## [20] ggplot2_3.3.3                           
## [21] ChIPseeker_1.26.2                       
## [22] knitr_1.33                              
## 
## loaded via a namespace (and not attached):
##   [1] shadowtext_0.0.8                       
##   [2] fastmatch_1.1-0                        
##   [3] BiocFileCache_1.14.0                   
##   [4] plyr_1.8.6                             
##   [5] igraph_1.2.6                           
##   [6] lazyeval_0.2.2                         
##   [7] splines_4.0.5                          
##   [8] BiocParallel_1.24.1                    
##   [9] digest_0.6.27                          
##  [10] ensembldb_2.14.1                       
##  [11] htmltools_0.5.1.1                      
##  [12] GOSemSim_2.16.1                        
##  [13] viridis_0.6.0                          
##  [14] GO.db_3.12.1                           
##  [15] fansi_0.4.2                            
##  [16] magrittr_2.0.1                         
##  [17] memoise_2.0.0                          
##  [18] BSgenome_1.58.0                        
##  [19] cluster_2.1.2                          
##  [20] Biostrings_2.58.0                      
##  [21] graphlayouts_0.7.1                     
##  [22] matrixStats_0.58.0                     
##  [23] askpass_1.1                            
##  [24] enrichplot_1.10.2                      
##  [25] prettyunits_1.1.1                      
##  [26] colorspace_2.0-1                       
##  [27] blob_1.2.1                             
##  [28] rappdirs_0.3.3                         
##  [29] ggrepel_0.9.1                          
##  [30] xfun_0.22                              
##  [31] dplyr_1.0.6                            
##  [32] crayon_1.4.1                           
##  [33] RCurl_1.98-1.3                         
##  [34] jsonlite_1.7.2                         
##  [35] graph_1.68.0                           
##  [36] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
##  [37] scatterpie_0.1.6                       
##  [38] survival_3.2-11                        
##  [39] glue_1.4.2                             
##  [40] polyclip_1.10-0                        
##  [41] gtable_0.3.0                           
##  [42] zlibbioc_1.36.0                        
##  [43] XVector_0.30.0                         
##  [44] GetoptLong_1.0.5                       
##  [45] DelayedArray_0.16.3                    
##  [46] shape_1.4.5                            
##  [47] scales_1.1.1                           
##  [48] DOSE_3.16.0                            
##  [49] futile.options_1.0.1                   
##  [50] DBI_1.1.1                              
##  [51] Rcpp_1.0.6                             
##  [52] plotrix_3.8-1                          
##  [53] viridisLite_0.4.0                      
##  [54] progress_1.2.2                         
##  [55] clue_0.3-59                            
##  [56] bit_4.0.4                              
##  [57] httr_1.4.2                             
##  [58] fgsea_1.16.0                           
##  [59] gplots_3.1.1                           
##  [60] ellipsis_0.3.2                         
##  [61] pkgconfig_2.0.3                        
##  [62] XML_3.99-0.6                           
##  [63] farver_2.1.0                           
##  [64] sass_0.3.1                             
##  [65] dbplyr_2.1.1                           
##  [66] locfit_1.5-9.4                         
##  [67] utf8_1.2.1                             
##  [68] tidyselect_1.1.1                       
##  [69] labeling_0.4.2                         
##  [70] rlang_0.4.11                           
##  [71] munsell_0.5.0                          
##  [72] tools_4.0.5                            
##  [73] cachem_1.0.4                           
##  [74] downloader_0.4                         
##  [75] generics_0.1.0                         
##  [76] RSQLite_2.2.7                          
##  [77] evaluate_0.14                          
##  [78] stringr_1.4.0                          
##  [79] fastmap_1.1.0                          
##  [80] yaml_2.2.1                             
##  [81] bit64_4.0.5                            
##  [82] tidygraph_1.2.0                        
##  [83] caTools_1.18.2                         
##  [84] purrr_0.3.4                            
##  [85] AnnotationFilter_1.14.0                
##  [86] KEGGREST_1.30.1                        
##  [87] ggraph_2.0.5                           
##  [88] RBGL_1.66.0                            
##  [89] formatR_1.9                            
##  [90] DO.db_2.9                              
##  [91] xml2_1.3.2                             
##  [92] biomaRt_2.46.3                         
##  [93] compiler_4.0.5                         
##  [94] curl_4.3.1                             
##  [95] png_0.1-7                              
##  [96] tibble_3.1.1                           
##  [97] tweenr_1.0.2                           
##  [98] bslib_0.2.4                            
##  [99] stringi_1.5.3                          
## [100] futile.logger_1.4.3                    
## [101] highr_0.9                              
## [102] lattice_0.20-44                        
## [103] ProtGenerics_1.22.0                    
## [104] Matrix_1.3-3                           
## [105] multtest_2.46.0                        
## [106] vctrs_0.3.8                            
## [107] pillar_1.6.0                           
## [108] lifecycle_1.0.0                        
## [109] BiocManager_1.30.14                    
## [110] GlobalOptions_0.1.2                    
## [111] jquerylib_0.1.4                        
## [112] data.table_1.14.0                      
## [113] cowplot_1.1.1                          
## [114] bitops_1.0-7                           
## [115] qvalue_2.22.0                          
## [116] R6_2.5.0                               
## [117] KernSmooth_2.23-20                     
## [118] gridExtra_2.3                          
## [119] codetools_0.2-18                       
## [120] lambda.r_1.2.4                         
## [121] boot_1.3-28                            
## [122] MASS_7.3-54                            
## [123] gtools_3.8.2                           
## [124] assertthat_0.2.1                       
## [125] SummarizedExperiment_1.20.0            
## [126] rjson_0.2.20                           
## [127] openssl_1.4.4                          
## [128] withr_2.4.2                            
## [129] regioneR_1.22.0                        
## [130] GenomicAlignments_1.26.0               
## [131] Rsamtools_2.6.0                        
## [132] GenomeInfoDbData_1.2.4                 
## [133] hms_1.0.0                              
## [134] VennDiagram_1.6.20                     
## [135] ggupset_0.3.0                          
## [136] tidyr_1.1.3                            
## [137] rmarkdown_2.8                          
## [138] rvcheck_0.1.8                          
## [139] MatrixGenerics_1.2.1                   
## [140] Cairo_1.5-12.2                         
## [141] ggforce_0.3.3

References

“(2014) Mod/Mouse/humanENCODE: Blacklisted Genomic Regions for Functional Genomics Analysis - Anshul Kundaje.” n.d. Accessed August 26, 2016. https://sites.google.com/site/anshulkundaje/projects/blacklists.
Anders, Simon, Paul Theodor Pyl, and Wolfgang Huber. 2015. HTSeq—a Python Framework to Work with High-Throughput Sequencing Data.” Bioinformatics 31 (2): 166–69. https://doi.org/10.1093/bioinformatics/btu638.
Anders, and Huber. 2010. “Differential Expression Analysis for Sequence Count Data.” Genome Biology 11.
Dobin, Alexander, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R. Gingeras. 2013. STAR: Ultrafast Universal RNA-Seq Aligner.” Bioinformatics 29 (1): 15–21. https://doi.org/10.1093/bioinformatics/bts635.
Heinz, Sven, Christopher Benner, Nathanael Spann, Eric Bertolino, Yin C. Lin, Peter Laslo, Jason X. Cheng, Cornelis Murre, Harinder Singh, and Christopher K. Glass. 2010. “Simple Combinations of Lineage-Determining Transcription Factors Prime Cis-Regulatory Elements Required for Macrophage and b Cell Identities.” Molecular Cell 38 (4): 576–89. https://doi.org/10.1016/j.molcel.2010.05.004.
Langmead, Ben, Cole Trapnell, Mihai Pop, and Steven L. Salzberg. 2009. “Ultrafast and Memory-Efficient Alignment of Short DNA Sequences to the Human Genome.” Genome Biology 10 (3): R25. https://doi.org/10.1186/gb-2009-10-3-r25.
Laurette, Patrick, Thomas Strub, Dana Koludrovic, Céline Keime, Stéphanie Le Gras, Hannah Seberg, Eric Van Otterloo, et al. 2015. “Transcription Factor MITF and Remodeller Brg1 Define Chromatin Organisation at Regulatory Elements in Melanoma Cells.” Edited by Michael R Green. eLife 4 (March): e06857. https://doi.org/10.7554/eLife.06857.
Martin, Marcel. 2011. “Cutadapt Removes Adapter Sequences from High-Throughput Sequencing Reads.” EMBnet.journal 17 (1): pp. 10–12. http://journal.embnet.org/index.php/embnetjournal/article/view/200.
Quinlan, Aaron R., and Ira M. Hall. 2010. BEDTools: A Flexible Suite of Utilities for Comparing Genomic Features.” Bioinformatics 26 (6): 841–42. https://doi.org/10.1093/bioinformatics/btq033.